Introduction

Study Topic & Questions of Interest

For our project we chose to perform an exploratory analysis and visualization on a dataset containing the historical auction prices of thousands of pieces of art. We chose this topic for a variety of reasons—whether you come from the arts and humanities, or economics and finance, art is an interesting and complex field. With new data starting to become available, for the first time we have the ability to analyze and better understand this historically opaque industry. It is a potential treasure trove for data scientists.

For hundreds of years fine art has been valued by cultures throughout the world for its aesthetic and beauty, but fine art as an asset class also appeals to many people today as an attractive alternative investment that is largely non-correlated to the traditional stock market. The fine art asset class is estimated to have a global market value of over $60 billion in annual turnover and has historically outperformed the S&P 500 in terms of compound annual growth (specifically, 15-year CAGR). Geographically, the art market has for decades been dominated by the United States, United Kingdom and China with the greatest concentration of art sales in the entire world occurring right here in New York City.

Each of our team members has a diverse background. In addition to experience as data scientists, a few of us have a background in economics and have worked for financial services companies. A few of us have also worked in the art industry for galleries or collectors. This topic was a cross-section of our experiences and interests that allowed us to explore timely questions about the prices and dynamics of the fine art market.

From a high level, the questions that intrigued us most were related to knowing more about the auction prices (called hammer price) for pieces of art in terms of different artists, location and time periods. We initially developed a large list of potential questions, and narrowed these down based on viability in terms of the data we had available as well as the scope and time constraints for the project. Some of the specific questions we decided to look at are:

  1. What is the distribution of quantity of lots and auctions by location, year, and season?
  2. What are the lot titles that have higher price/appear more?
  3. How is price related to artist’s era/when he was born?
  4. Do auction prices vary by location and season?
  5. Did Financial Crisis have any effect on the average auction price by location?
  6. Does the order in which the lot is presented affect the overall sell price?

Team Members & Contributions

Each team meber worked equally hard to bring this project to live. While having weekly progress meetings and brainstorming sessions, we have adopted a divide and concour approach around buildings graphs and the corresponding write-up.

Introduction as well as Data Description were written by Basil Vetas. As a full time employee of Arthena, Basil was also intrumential in helping us collect the dataset used in the project. In addition, Basil worked on the interactive component of the assignment.

Data Quality analysis as well as some of the clean-up was performed by Alexandra Sudomoeva. Alexandra also built visualizations for the Financial crisis impact for the main analysis. Lastly, she assisted in writing the instructions for the interactive component.

Serena Zhang worked on data preprocessing and the main analysis as well as the executive summary. Her analysis included auction price correlation with lot titles, artist’s era, order in auction, season and location.

The main contributor to the analysis around basic information was Elizabet Doliar. She also worked at analyzing project titles and creating an auction price summary by building a parrallel coordinates plot. Lastly, Elizabet helped put together the executive summary.

Data Description

Overview

This dataset was provided by Arthena (https://arthena.com/). It was sourced from Sotheby’s historical auction data via the scraping of public web pages. The raw dataset includes 22711 rows of data. Each row represents an individual lot from an auction (a lot could be an individual painting, a sculpture, or sometimes even a collection of works). The raw dataset includes 25 columns of data. Each column represents a feature related to either that specific lot, the artist of the lot, or the auction where the lot was sold. Column definitions are listed below by category (Lot, Auction or Artist). For our analysis we also derive new columns from the original raw dataset. The derived column definitions are also listed below. Including these columns we ended with a total of 36 columns of data. The detailed logic and thought process is described more in the ‘Data Quality Analysis’ section.

art = read.csv("final_sothebys.csv", header=TRUE, na.strings=c("", NA))
#Save as a data frame
art_df <- as.data.frame(art)

Column Definitions:

#Dropping features with no interest (non-informative) in the analysis
drop <- c("X", "Unnamed..0", "provenance", "auction_house_id", "external_image_url", "literature", "end_date")
art_df <- art_df[, !(names(art_df) %in% drop)]

# Note that "provenance", "auction_house_id", "external_image_url", "literature" and "end_date" were dropped
names(art_df)
##  [1] "lot_id"                    "auction_id"               
##  [3] "lot_title"                 "estimate_low"             
##  [5] "estimate_high"             "hammer_price_bp"          
##  [7] "currency"                  "nth_in_auction"           
##  [9] "lot_number"                "condition"                
## [11] "auc_title"                 "number_of_lots"           
## [13] "location"                  "start_date"               
## [15] "auc_desc"                  "sale_id"                  
## [17] "artist_id"                 "name"                     
## [19] "birth_year"                "death_year"               
## [21] "auc_date"                  "auc_year"                 
## [23] "auc_month"                 "auc_year_month"           
## [25] "auc_year_month_date"       "auc_season_num"           
## [27] "auc_season"                "is_untitled"              
## [29] "auc_order"                 "percent_in_auction"       
## [31] "estimate_avg"              "auth_era_num"             
## [33] "auth_era"                  "hammer_price_bp_usd"      
## [35] "estimate_low_usd"          "estimate_high_usd"        
## [37] "estimate_avg_usd"          "hammer_price_bp_usd_range"

Lot
lot_id: a unique id for each lot.
lot_title: the title of the lot. A lot can sometimes consist of multiple pieces of art. We assume that 1 piece is 1 lot since that is most common.
estimate_low: the low-end auction price estimate for a lot, given by Sothebys.
estimate_high: the high-end auction price estimate for a lot, given by Sothebys.

hammer_price_bp: how much the lot was sold for at auction, plus buyers premium (a percentage fee taken by Sothebys and paid by the buyer).
currency: currency denomination for the price estimates and hammer price (limited to USD, EUR, GBP, HKD).
nth_in_auction: the order that the lot was presented in at auction.
lot_number: a number assigned to a lot for the given auction, different than nth_in_auction.
condition: description of the condition of the lot (messy text field - not used for our analysis).
provenance: description of who owned the lot previously (messy text field - not used for our analysis).
literature: different publications that the lot was mentioned in (messy text field - not used for our analysis).
external_image_url: link to the image (not used for our analysis).

Auction
auction_house_id: unique id for each auction house (for this dataset, all 1 since we are only using Sothebys data).
auction_id: unique id for each auction.
auc_title: title of the auction.
number_of_lots: total number of lots in the auction.
location: location where the auction was held.
start_date: start date of the auction.
end_date: end date of the auction (same as start date for this dataset).
auc_desc: description of the auction (messy text field - not used for our analysis).

sale_id: unique auction sale id assigned by Sothebys.

Artist
artist_id: unique id for each artist.
name: name of the artist.
birth_year: artist’s approximate birth year (messy text field - not used for our analysis).
death_year: artist’s approximate death year (messy text field - not used for our analysis).

Derived
estimate_avg: the average between estimate_low and estimate_high.
is_untitled: an indicator variable whether the name of the lot is “untitled” (in some language).
auc_year: the year of the auction (YYYY format).
auc_month: the month of the auction (as integers 1-12).
auc_season: the season of the auction (as integers 1-4).
auc_date: the date of the auction.
auc_order: the order in the auction by quantiles (as integers 1-10).
percent_in_auction: the percentage through an auction that a lot was shown (nth_in_auction divided by number_of_lots).
hammer_price_bp_usd: hammer_price_bp converted to usd.
estimate_high_usd: estimate_high converted to usd.
estimate_low_usd: estimate_low converted to usd.
estimate_avg_usd: estimate_avg converted to usd.
price_range: factor of bucketed hammer price ranges in usd.

# data types
sapply(art_df, class)
##                    lot_id                auction_id 
##                 "integer"                 "integer" 
##                 lot_title              estimate_low 
##                  "factor"                 "numeric" 
##             estimate_high           hammer_price_bp 
##                 "numeric"                 "numeric" 
##                  currency            nth_in_auction 
##                  "factor"                 "integer" 
##                lot_number                 condition 
##                 "integer"                  "factor" 
##                 auc_title            number_of_lots 
##                  "factor"                 "integer" 
##                  location                start_date 
##                  "factor"                  "factor" 
##                  auc_desc                   sale_id 
##                  "factor"                  "factor" 
##                 artist_id                      name 
##                 "integer"                  "factor" 
##                birth_year                death_year 
##                 "numeric"                  "factor" 
##                  auc_date                  auc_year 
##                  "factor"                 "integer" 
##                 auc_month            auc_year_month 
##                 "integer"                  "factor" 
##       auc_year_month_date            auc_season_num 
##                  "factor"                 "integer" 
##                auc_season               is_untitled 
##                  "factor"                 "integer" 
##                 auc_order        percent_in_auction 
##                 "integer"                 "numeric" 
##              estimate_avg              auth_era_num 
##                 "numeric"                 "numeric" 
##                  auth_era       hammer_price_bp_usd 
##                  "factor"                 "numeric" 
##          estimate_low_usd         estimate_high_usd 
##                 "numeric"                 "numeric" 
##          estimate_avg_usd hammer_price_bp_usd_range 
##                 "numeric"                  "factor"

Data Quality Analysis

Preprocessing

In this part of the project, we will be exploring the quality of the data provided for the auction data. Before analyzing the overall quality, a simple preprocessing has been conducted in Python that included the following steps:

  1. Added columns for normalized date (whole), year, month, and season.
  2. ‘Start Date’ and ‘End Date’ are always the same within the dataset. Therefore, we decided to only user ‘Start Date’ for the time when the auction occured.
  3. The “Title” column accepted a varity of title across different languages. We added a new column to show whether the piece is “untitles” while checking for the top 5 most common languages used: Italian, Dutch, English, French, Spanish, German.
  4. Created a new column that would break the order of the auction into 10 tiles and group all pieces into those tiles by auction id
  5. Added ‘Average Estimate’ column to reflect the aevrage price estimation between the high and the low
  6. Converted all currency to USD to be able to compare pieces sold across different locations. We matched the exchange rate at the time of the sell to properly convert all transactions.

A complete logic with code can be found here: https://github.com/serenazzz/art-auction-visualizatoin-project/blob/master/Preprocessing_final.ipynb

Data Quality Exploration

Let us start by looking at all the missing values and if there are any general patterns. Since we are dealing with a large amount of rows, we will reduce repeated patters to one row using the visna function.

#install.packages("extracat")
library(extracat)
visna(art_df, sort= "b")

Looking at the output above, we can conclude that the overall state of the dataset is relatively good. The second most common pattern has no missing values while the two most “problematic” features appear to be death_year and birth_year.

One important observation we will need to be careful about during further analysis is a relatively significant amount of missing values under hammer_price_bp feature. We have explored qualitative reasons behind the missing values in that category with the provider of the data. After careful observation, we found out that the missing values are actually driven by two locations that migt have less strict regulations around data governance (Doha). This observation is outlined in the graph below.

#install.packages("viridis")
library(viridis)
art_location <- art_df %>% gather(attribute, value, -location) %>% mutate(missing = ifelse(is.na(value), "yes", "no"))
ggplot(art_location, aes(x = location, y = attribute, fill = missing)) +
  geom_tile(color = "white") +
  ggtitle("Missing Values by Location") +
  xlab("Location") + ylab("Feature Name") +
  scale_fill_viridis(discrete=TRUE) +
  theme_bw()

This map allows us to see that some features’ missing values like hammer_price_bp, condition, and auc_desc are actually only missing in certain location (typically in pairs of two). Therefore, we can speculate that this data is MAR (missing at random) depending on a location feature.

Indeed, when looking at the table below, the overall percentage of missing values across difeerent locations varies quite significantly. Hong Kong is the top location with most missing values (8% of total).

art_location <- art_df %>% gather(attribute, value, -location)
percent_missing <- art_location %>% group_by(location) %>%
  summarise(num_na = sum(is.na(value)), total = n()) %>%
  mutate(percent_na = num_na/total)%>%
  arrange(-percent_na)
percent_missing
## # A tibble: 8 x 4
##   location  num_na  total percent_na
##   <fct>      <int>  <int>      <dbl>
## 1 AMSTERDAM   6638  81992     0.0810
## 2 HONG KONG   1189  16872     0.0705
## 3 PARIS       6156  94054     0.0655
## 4 MILAN       2866  51097     0.0561
## 5 DUBAI         90   1998     0.0450
## 6 NEW YORK   15565 357531     0.0435
## 7 LONDON      9708 233507     0.0416
## 8 DOHA          98   2479     0.0395

This leads to a logical question. Are there any other variables that could explain the missing data? Therefore, we also looked at simular graphs while grouping by year and month.

#install.packages("viridis")
#library(viridis)
art_year <- art_df %>% gather(attribute, value, -auc_year) %>% mutate(missing = ifelse(is.na(value), "yes", "no"))
year <- ggplot(art_year, aes(x = factor(auc_year), y = attribute, fill = missing)) +
  geom_tile(color = "white") + 
  ggtitle("Missing Values by Year") +
  xlab("year") + ylab("Feature Name") +
  scale_fill_viridis(discrete=TRUE) +
  theme_bw()
art_month <- art_df %>% gather(attribute, value, -auc_month) %>% mutate(missing = ifelse(is.na(value), "yes", "no"))
month <-ggplot(art_month, aes(x = factor(auc_month), y = attribute, fill =       
                                missing)) +
  geom_tile(color = "white") + 
  ggtitle("Missing Values by Month") +
  xlab("Month") + ylab("Feature Name") +
  scale_fill_viridis(discrete=TRUE) +
  theme_bw()
grid.arrange(year, month)

While month looks more or less equally distributed, there is an interesting pattern forming in the year view. It comes as no surprise that older years (2006-2009) would carry all of the missing values for features around description and condition. There also looks to be certain years (more like 3 year periods) with perfectly clean data: 2010-2012 and 2015-2017.

Next, let us look at the exact percentages and values for the overall missing data.

#install.packages("skimr")
library(skimr)
skimr::skim(art_df) %>% filter(stat =="missing") %>% arrange(desc(value)) %>% select(variable, value) %>% mutate(percent = value/nrow(art_df)) %>% filter (percent>0)
## # A tibble: 8 x 3
##   variable            value percent
##   <chr>               <dbl>   <dbl>
## 1 death_year          14806   0.653
## 2 birth_year           5676   0.250
## 3 auth_era_num         5676   0.250
## 4 hammer_price_bp      3898   0.172
## 5 hammer_price_bp_usd  3898   0.172
## 6 auc_desc             3079   0.136
## 7 condition            2808   0.124
## 8 lot_title            2469   0.109

Looking at the table, it actually appears that the estimates (along with its corresponding values generated during pre-processing) are only missing a very small amount of data (less than 1%). Therefore, such a small error can be easily filtered out in future analysis.

Nonetheless, hammer_price is missing nearly 20%. Besides what we saw when looking by location, our guess is that some auction ids are missing the hammer_price_bp in its entirety and hence the difference. It can be easily checked by looking at the aggregate table.

art_price <- art_df[, c("auction_id", "hammer_price_bp")]
percent_missing <- art_price %>% group_by(auction_id) %>%
  summarise(num_na = sum(is.na(hammer_price_bp)), total = n()) %>%
  mutate(percent_na = num_na/total)%>%
  arrange(-percent_na)
percent_missing %>% filter(percent_na>=0.4)
## # A tibble: 10 x 4
##    auction_id num_na total percent_na
##         <int>  <int> <int>      <dbl>
##  1          2    258   258      1.00 
##  2          3    173   173      1.00 
##  3        108     48    90      0.533
##  4        180      8    16      0.500
##  5        100    116   240      0.483
##  6        186      7    15      0.467
##  7        133     47   103      0.456
##  8         30     36    81      0.444
##  9         20    136   335      0.406
## 10         19     14    35      0.400

Looking at the data table, there is a significant number of auctions that are missing more than 10% of the price data (sometimes even 100%). Therefore, it must be that not only the locations but also the auction itself is a determining factor in missing hammer price value.

Lastly, we would like to check for correlation between average hammer_price_bp and percentage of missing values (since it is the one feature that matters the most in our analysis and is most prime to such correlations). Could it be that very high/low lots are simply not reported and, therefore, are missing more?

art_auction <- art_df[, c("auction_id", "hammer_price_bp")] %>% arrange(auction_id)
art_average <-art_auction %>% filter(!is.na(hammer_price_bp)) %>% 
  group_by(auction_id) %>% summarise(mean =mean(hammer_price_bp)) %>% arrange(auction_id)
percent_missing <- percent_missing %>% arrange(auction_id)
add_2 <- data.frame(auction_id=2, mean=0)
add_3 <- data.frame(auction_id=3, mean=0)
art_average <- rbind(art_average, add_2)
art_average <- rbind(art_average, add_3) %>% arrange(auction_id)
percent <- percent_missing[4]
average <- art_average[2]
auction_id <- percent_missing[1]
corr <- data.frame(auction_id=auction_id, percent=percent, average = average)
ggplot(corr, aes(average, percent)) + geom_point(col="blue") +ggtitle("Auction Average Hammer Price vs Percentage of Missing Values") 

There does not seem to be very strong correlation between the two variables.Just a subtle suggestion that auctions with smaller average price tend to have more NAs. Therefore, we will only consider location as the main determinant for missing values around hammer_price_bp.

Before we move on, let us quickly follow up on the action items from the data quality analysis.

  1. filter out NAs for price estimates
  2. take a note to exclude Doha and Hong Kong fro hammer_price_bp analysis
art_final <- art_df %>% filter(!is.na(estimate_low))

Main Analysis (Exploratory Data Analysis)

# dropping rows that have NAs in the hammer price column, which indicates that the lots are not sold
# drop the rows that are from doha, as indicated in the data quality analysis
df1 <- art_df %>% drop_na(hammer_price_bp_usd) 
df2 <- df1 %>%
  filter(abs(df1$hammer_price_bp_usd - 
               median(df1$hammer_price_bp_usd)) <=3*sd(df1$hammer_price_bp_usd)) %>%
  filter(location %in%  c("HONG KONG", "NEW YORK", "LONDON", "PARIS", "MILAN","DUBAI","AMSTERDAM"))

Auction Information

Our research can be divided into three sections: General questions about the data, understanding fluctuations in lot prices and deriving general conclusion.

We started by asking many questions about possible relationships between variables. The first set of plots will explore the correlation between number of lots sold and year, location and season. We were hoping to notice meaningful trends that can be further explored in subsequent sections. Since we had only a few locations, seasons and years we chose a barchart and excluded duplicate rows by the ‘number of lots’ column.

library(gridExtra)
##Lots by location

MyData3 <- subset(art_final, select=c( "location", "number_of_lots" ))
MyData3 <- MyData3[!duplicated(MyData3$number_of_lots),]

MyData4 <- MyData3 %>% group_by(location)%>% summarise(B=sum(number_of_lots)) 

p1 <- ggplot(MyData4, aes(x= location, y = B)) + 
    geom_bar( stat='identity', color="blue", fill="grey")  +labs(x = "Location")+labs(y = "Total number of lots") + ggtitle("Number of lots sold by location") + theme(plot.title = element_text(hjust = 0.5,face="bold")) + theme_minimal() 

MyData_1 <- subset(art_final, select=c( "auc_year", "number_of_lots" ))

MyData_1 <- MyData_1[!duplicated(MyData_1$number_of_lots),]

MyData_1 <- MyData_1 %>% group_by(auc_year)%>% summarise(B=sum(number_of_lots)) 

p2 <- ggplot(MyData_1, aes(x= auc_year, y = B)) + 
    geom_bar( stat='identity', color="yellow", fill="grey")  +labs(x = "Auction year")+labs(y = "Total number of lots") + ggtitle("Number of lots sold by year") + theme(plot.title = element_text(hjust = 0.5,face="bold")) + theme_minimal()+ scale_x_continuous(breaks= c(2006, 2010, 2014, 2017)) 

grid.arrange(p1, p2, nrow = 1)

##Lots by season

MyData3 <- subset(art_final, select=c( "auc_season", "number_of_lots" ))

MyData3 <- MyData3[!duplicated(MyData3$number_of_lots),]

MyData4 <- MyData3 %>% group_by(auc_season)%>% summarise(B=sum(number_of_lots)) 

ggplot(MyData4, aes(x= auc_season, y = B)) + 
    geom_bar( stat='identity', color="orange", fill="grey")  +labs(x = "Auction season")+labs(y = "Total number of lots") + ggtitle("Number of lots sold by season") + theme(plot.title = element_text(hjust = 0.5,face="bold"))  + theme_minimal()

Graphs above suggest that fall and spring are the most popular seasons to acquire a masterpiece. At the same time, there was a significant increase in number of lots sold since 2010, suggesting that investors started to see the art market as a form of long-term non-liquid investment after the financial crisis of 2008. The main hubs for art exchanges formed in London, New York and Paris.

Plotting the number of auctions based on location, season and year turned out to have similar results and confirmed ideas suggested above.

##Auctions by location, year, season

MyData3 <- subset(art_final, select=c( "auction_id", "location" ))

MyData3 <- MyData3[!duplicated(MyData3$auction_id),]

MyData4 <- MyData3 %>% group_by(location)%>% count(auction_id) %>% summarise(B=sum(n)) 


p1<-ggplot(MyData4, aes(x= location, y = B)) + 
    geom_bar( stat='identity', color="green", fill="grey")  +labs(x = "Auction location")+labs(y = "Number of auctions") + ggtitle("Number of auctions by location") + theme(plot.title = element_text(hjust = 0.5,face="bold"))  + theme_minimal()

MyData_1 <- subset(art_final, select=c( "auction_id", "auc_year" ))

MyData_1 <- MyData_1[!duplicated(MyData_1$auction_id),]

MyData_1 <- MyData_1 %>% group_by(auc_year)%>% count(auction_id) %>% summarise(B=sum(n)) 


p2<- ggplot(MyData_1, aes(x= auc_year, y = B)) + 
    geom_bar( stat='identity', color="blue", fill="grey")+labs(y = "Number of auctions")+labs(x = "Year") + ggtitle("Number of auctions by year") + theme(plot.title = element_text(hjust = 0.5,face="bold"))  + theme_minimal()

grid.arrange(p1, p2, nrow = 1)

##Auctions by season

MyData3 <- subset(art_final, select=c( "auction_id", "auc_season" ))

MyData3 <- MyData3[!duplicated(MyData3$auction_id),]

MyData4 <- MyData3 %>% group_by(auc_season)%>% count(auction_id) %>% summarise(B=sum(n)) 

ggplot(MyData4, aes(x= auc_season, y = B)) + 
    geom_bar( stat='identity', color="orange", fill="grey")  +labs(x = "Auction season")+labs(y = "Number of auctions") + ggtitle("Number of auctions by season") + theme(plot.title = element_text(hjust = 0.5,face="bold"))  + theme_minimal()

To see what are the different triggers of variability in art prices we decided to explore fluctuations in Hammer Prices. We created a histogram with a density curve to visualize the distribution of Hammer Price.

##Hammer Price
ggplot(art_final, aes(x= hammer_price_bp_usd)) + geom_histogram(aes(y=..density..))  + geom_density() + xlim(0,1500000) + theme_minimal()

#df2 <- art_final %>%
#  filter(abs(art_final$hammer_price_bp_usd - 
#               median(art_final$hammer_price_bp_usd)) <=3*sd(art_final$hammer_price_bp_usd))

#ggplot(df2, aes(x= hammer_price_bp_usd)) + geom_histogram(aes(y=..density..))  + geom_density() + xlim(0,1500000)+ theme_minimal()

Distribution of Hammer Price is skewed to the right and has a long tail. We believe the reason for this is variability in art geners (Contemporary vs Renaissance for example) sold on the market. Some older masterpieces can be traded at prices that are completly out of range for contemporary artists.

Constracting valuable models in the next parts of our research fully depend on the ability to manipulate hammer price in the right way. We decided to remove a few data points from the tail by considering them as outliers.

##revenue by location

#MyData <- subset(art_final, select=c( "location", "auction_id", "hammer_price_bp_usd" ))


MyData5 <- art_final %>% group_by(location)%>% summarise(B=sum(hammer_price_bp_usd)) 

p1 <- ggplot(MyData5, aes(x= location, y = B)) + 
    geom_bar( stat='identity', color="yellow", fill="grey")+labs(y = "Auction Revenue")+labs(x = "Location") + ggtitle("Auction revenue by location") + theme(plot.title = element_text(hjust = 0.5,face="bold")) + theme_minimal()

##revenue by year

MyData_1 <- subset(art_final, select=c( "auc_year", "auction_id", "hammer_price_bp_usd" ))

MyData_1 <- MyData_1 %>% group_by(auc_year)%>% summarise(B=sum(hammer_price_bp_usd)) 

p2<-ggplot(MyData_1, aes(x= auc_year, y = B)) + 
    geom_bar(stat='identity', color="green", fill="grey")+labs(y = "Auction Revenue")+labs(x = "Year") + ggtitle("Auction revenue by year") + theme(plot.title = element_text(hjust = 0.5,face="bold")) + theme_minimal()

grid.arrange(p1, p2, nrow = 1)

Adjusted hammer price brought our attention to distribution of revenue over time and location. For both plots we chose a bar chart. In general, auction revenues are going up, with highest revenues in a decade being in 2017. London and New York continue to lead the way as the main centers for the exchange of art.

##Clevelend dot plot
MyData2 <-  strtrim(art_final$auc_title, 30)
art_final$auc_title <- MyData2

MyData1 <- art_final[!duplicated(art_final$auc_title),]

MyData3 <- subset(MyData1, select=c("auc_title", "number_of_lots"))

MyData3$auc_title <- factor(MyData3$auc_title, levels = MyData3$auc_title[order(MyData3$number_of_lots)])

MyData3<-  MyData3[which(MyData3$number_of_lots>200),]

ggplot(MyData3, aes( x = auc_title, y = number_of_lots)) + geom_point(stat="identity",  color="red") + coord_flip()+ theme_minimal()

MyDatas <-subset(art_final, select=c( "nth_in_auction", "auc_year" , "birth_year", "hammer_price_bp_usd", "location"))


library(GGally)
ggparcoord(MyDatas, columns = 1:(ncol(MyDatas)-1), scale = "uniminmax", groupColumn = "location", alphaLines =0.2) 

#devtools::install_github("timelyportfolio/parcoords")
library(parcoords)
#library(httpuv)

parcoords(MyDatas,
    rownames = F 
    , brushMode = "1D-axes", color = list(
      colorBy = "location", colorScale = htmlwidgets::JS("d3.scale.category10()"))
    ) 

Do Certain Lot Attributes Result in Higher Price?

Lot Titles

What lots have higher price?

df_wordcloud <- df1[,c("lot_title","hammer_price_bp_usd")]
df_wordcloud <- arrange(df_wordcloud,desc(df_wordcloud$hammer_price_bp_usd))[1:500,]
library(wordcloud)
library(tm)
pal <- brewer.pal(9, "OrRd")
pal <- pal[-(1:3)]
wordcloud(df_wordcloud$lot_title, df_wordcloud$hammer_price_bp_usd, min.freq=500, scale=c(5, .5), random.order = FALSE, random.color = FALSE, colors= pal)

#### What words appear more often in the lot titles?

# collapse the lot_title column by word and count the frequency they appear in the titles.
temp <- paste(df1$lot_title, collapse=' ' )
temp <- tolower(temp)
temp <- gsub(" *\\b[[:alpha:]]{1}\\b *", " ", temp)
temp <- gsub('[[:punct:] ]+',' ',temp)
temp <- as.list(strsplit(temp, " "))
temp <- unlist(temp)[!(unlist(temp) %in% stopwords("english"))]
temp <- unlist(temp)[!(unlist(temp) %in% "na")]
word_count <- na.omit(as.data.frame(table(temp)))
word_count <- arrange(word_count,desc(word_count$Freq))[1:300,]

# visualize word frequencies
pal <- brewer.pal(9, "Dark2")
wordcloud(word_count$temp, word_count$Freq, min.freq =20, scale=c(5, .5), random.order = FALSE, random.color = FALSE, colors= pal)

#### Looking at lots that have name “Untitled”, what price ranges are they in? Is it correlated?

library(vcd)
df1 <- df1 %>% 
  dplyr::mutate(hammer_price_bp_usd_range = forcats::fct_relevel(hammer_price_bp_usd_range, "<$50,000"))%>%
  dplyr::mutate(hammer_price_bp_usd_range = forcats::fct_relevel(hammer_price_bp_usd_range, "<$500,000"))%>%
   dplyr::mutate(hammer_price_bp_usd_range = forcats::fct_relevel(hammer_price_bp_usd_range, "$500,000+"))

vcd::mosaic(hammer_price_bp_usd_range~is_untitled, direction = c("v", "h"),df1,
       gp = gpar(fill = c("lightyellow", "lightpink")),
                 labeling = labeling_border(rot_labels = c(0, 45),pos_labels="center"))

#### Are they mostly contemprary?

#untitle_labels = read.csv("untitle_ratio.csv", header=TRUE)

Does the era of the lot affect its price?

df3 <- df2 %>% 
  filter(df2$birth_year>1800)
ggplot(df3, aes(birth_year,hammer_price_bp_usd))  + 
  geom_smooth(method='lm',formula=y ~ poly(x, 2))+
  geom_point(alpha = .1)  + 
  theme_grey(10)+scale_y_log10()+  geom_density_2d(bins = 5)

#box plot price vs year
#ggplot(df3, aes(x=auth_era), y=hammer_price_bp_usd)+ stat_summary(fun.y="mean", geom="line", aes(group=1))
ggplot(df3, aes(auth_era, hammer_price_bp_usd)) +
  geom_boxplot()+theme(axis.text.x = element_text(angle = 90, hjust = 1))+scale_y_log10()

## Do Certain Extrernal Factors Result in Higher Price? ###Does the Order Matter?

ggplot(df2, aes(percent_in_auction,hammer_price_bp_usd))  + 
  geom_smooth(method='lm',formula=y~x,color="red")+
  geom_point(alpha = .05)  + 
  theme_grey(10)+scale_y_log10()+
  facet_wrap(~location)

### Is there an impact from the financial crisis?

df1$auc_ymd <- as.Date(df1$auc_year_month_date)
art_yearfin <- df1 %>% group_by(month=lubridate::floor_date(auc_ymd, "month")) %>% summarise(revenue = mean(hammer_price_bp_usd))
ggplot() +geom_line(data=art_yearfin, aes(x=month, y=revenue/1000000))+ggtitle("Financial Crisis' Effect on average lot price ($M)")+ylab("average price")+xlab("Time")+theme(axis.text.x = element_text(angle = 45, hjust = 1))+scale_y_log10()

Let’s start by looking at the average lot prices of Sothebeys on a yearly scale. Our guess is that we are supposed to see a significant drop around the time of the financial crisis.

art_finance <- art_final[c("auc_year", "auc_month", "location","hammer_price_bp_usd")] %>% filter(!is.na(hammer_price_bp_usd))
art_yearfin <- art_finance %>% group_by(auc_year) %>% summarise(revenue = sum(hammer_price_bp_usd))
ggplot(art_yearfin, aes(x=auc_year, y=as.numeric(revenue/1000000))) +geom_line(col="blue")+ggtitle("Financial Crisis' Effect on Auction Revenue ($M)")+ylab("Revenue")+xlab("Year")

Indeed, we are seeing a big deep around late 2010. It is surprising to see that it took some time for the effect to reach the auction houses. Perhaps it is certain locations that are skewing some of the data. Let us try to facet the data by location and see if that could present us with a better outlook.

art_locfin <- art_finance %>% group_by(auc_year, location) %>% summarise(revenue = sum(hammer_price_bp_usd))
ggplot(art_locfin, aes(x=auc_year, y=as.numeric(revenue/1000000))) +geom_line(col="blue")+ggtitle("Financial Crisis' Effect on Auction Revenue ($M)")+ylab("Revenue")+xlab("Year") + facet_wrap(~location, scales ="free_y")

It looks like, indeed, some locations do not have consistent data and might be throwing the calculation off (Dubai, Doha, and Amsterdam). Let us tke a look at the same chart without those locations.

art_locfin <- art_finance %>% group_by(auc_year, location) %>% summarise(revenue = sum(hammer_price_bp_usd)) %>% filter(location %in%  c("HONG KONG", "NEW YORK", "LONDON", "PARIS", "MILAN"))
ggplot(art_locfin, aes(x=auc_year, y=as.numeric(revenue/1000000))) +geom_line(col="blue")+ggtitle("Financial Crisis' Effect on Auction Revenue ($M)")+ylab("Revenue")+xlab("Year") + facet_wrap(~location, scales ="free_y")

We notice a consistent drop in revenue across all locations above starting 2010. Therefore, our hypothesys must be correct: the financial crisis did have an effect on the auction revenue across the world (specifically significant drops are observed in New York and Hong Kong).

Does Season Matter?

vcd::mosaic(hammer_price_bp_usd_range~auc_season, direction = c("v", "h"),df1,
       gp = gpar(fill = c("lightyellow", "lightpink")),
                 labeling = labeling_border(rot_labels = c(0, 90),pos_labels="center"))

###Does Location Matter? # Executive Summary

Interactive Componenet

Link to the interactive part: https://edav-art-viz.firebaseapp.com

Description

This view allows users to see all of the lots sold by their price and timeframe. Each color also represents a different location. This visualization supports the project’s main hypothesis that location and season significantly affect the auction performance.

Instructions

There are quite a few great aspect of this chart that could be very helpful for the user:

  1. Change Year - Click on the arrows next to the year to either move a year forward or backwards. This can help to compare the different distributions across all years in our dataset.
  2. See Lot Details - Hoover over each circle to see the details of the lot transaction. You should expect to see a window that is showing the hammer price as well as the picture of the lot. Keep in mind that a lot of the lots have their pictures protected by the copyright (so they will not show). Nonetheless, some of the later ones typicaally do:)
  3. Spread Data Points - to better see each data point for the details use the button on the top right that says: “Spread Data Points.” This would jitter the points from their original location to better see all options. One can easily bring the points back to proper location by unclicking the button.